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Abstract 

We introduce Natural Neural Networks, a novel family of algorithms that speed up 
convergence by adapting their internal representation during training to improve 
conditioning of the Fisher matrix. In particular, we show a specific example that 
employs a simple and efficient reparametrization of the neural network weights by 
implicitly whitening the representation obtained at each layer, while preserving 
the feed-forward computation of the network. Such networks can be trained effi¬ 
ciently via the proposed Projected Natural Gradient Descent algorithm (PRONG), 
which amortizes the cost of these reparametrizations over many parameter up¬ 
dates and is closely related to the Mirror Descent online learning algorithm. We 
highlight the benefits of our method on both unsupervised and supervised learn¬ 
ing tasks, and showcase its scalability by training on the large-scale ImageNet 
Challenge dataset. 


1 Introduction 

Deep networks have proven extremely successful across a broad range of applications. While their 
deep and complex structure affords them a rich modeling capacity, it also creates complex depen¬ 
dencies between the parameters which can make learning difficult via first order stochastic gradient 
descent (SGD). As long as SGD remains the workhorse of deep learning, our ability to extract high- 
level representations from data may be hindered by difficult optimization, as evidenced by the boost 
in performance offered by batch normalization (BN) (Tl on the Inception architecture 1251 . 

Though its adoption remains limited, the natural gradient ID appears ideally suited to these difficult 
optimization issues. By following the direction of steepest descent on the probabilistic manifold, 
the natural gradient can make constant progress over the course of optimization, as measured by the 
Kullback-Leibler (KL) divergence between consecutive iterates. Utilizing the proper distance mea¬ 
sure ensures that the natural gradient is invariant to the parametrization of the model. Unfortunately, 
its application has been limited due to its high computational cost. Natural gradient descent (NGD) 
typically requires an estimate of the Fisher Information Matrix (FIM) which is square in the number 
of parameters, and worse, it requires computing its inverse. Truncated Newton methods can avoid 
explicitly forming the FIM in memory IT^fTSl . but they require an expensive iterative procedure to 
compute the inverse. Such computations can be wasteful and do not take into account the smooth 
change of the Fisher during optimization or the highly structured nature of deep models. 

Inspired by recent work on model reparametrizations HT] [TSl, our approach starts with a sim¬ 
ple question: can we devise a neural network architecture whose Fisher is constrained to be 
identity? This is an important question, as SGD and NGD would be equivalent in the resulting 
model. The main contribution of this paper is in providing a simple, theoretically justified network 
reparametrization which approximates via first-order gradient descent, a block-diagonal natural gra¬ 
dient update over layers. Our method is computationally efficient due to the local nature of the 
reparametrization, based on whitening, and the amortized nature of the algorithm. Our second con¬ 
tribution is in unifying many heuristics commonly used for training neural networks, under the roof 
of the natural gradient, while highlighting an important connection between model reparametriza¬ 
tions and Mirror Descent na. Finally, we showcase the efficiency and the scalability of our method 
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across a broad-range of experiments, scaling our method from standard deep auto-encoders to large 
convolutional models on ImageNet ll20l . trained across multiple GPUs. This is to our knowledge the 
first-time a (non-diagonal) natural gradient algorithm is scaled to problems of this magnitude. 


2 The Natural Gradient 


This section provides the necessary background and derives a particular form of the FIM whose 
structure will be key to our efficient approximation. While we tailor the development of our method 
to the classification setting, our approach generalizes to regression and density estimation. 


2.1 Overview 

We consider the problem of fitting the parameters 0 G of a model p{y \ x; 6>) to an empirical 
distribution 7r{x,y) under the log-loss. We denote by x G T' the observation vector and ^ G 3^ its 
associated label. Concretely, this stochastic optimization problem aims to solve: 

0* e argmin^ [- logp(y | x, 0)]. (1) 


Defining the per-example loss as £{x,y). Stochastic Gradient Descent (SGD) performs the above 
minimization by iteratively following the direction of steepest descent, given by the column vector 
V = [dildO\. Parameters are updated using the rule ^ where a is a 

learning rate. An equivalent proximal form of gradient descent IH reveals the precise nature of a\ 
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Namely, each iterate is the solution to an auxiliary optimization problem, where a controls 

the distance between consecutive iterates, using an L 2 distance. In contrast, the natural gradient 
relies on the KL-divergence between iterates, a more appropriate distance measure for probability 
distributions. Its metric is determined by the Fisher Information matrix. 


Fq = 



y^p(y\x,9) 
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i.e. the covariance of the gradients of the model log-probabilities wrt. its parameters. The natural 
gradient direction is then obtained as Vat = V. See |[T5l[T4l for a recent overview of the topic. 


2.2 Fisher Information Matrix for MLPs 

We start by deriving the precise form of the Fisher for a canonical multi-layer perceptron (MLP) 
composed of L layers. We consider the following deep network for binary classification, though our 
approach generalizes to an arbitrary number of output classes. 

p{y = l\x) = hL = + ^l) (4) 


hi = hiWixFhi) 


The parameters of the MLP, denoted 0 = {lUi, 61 , • • • , IUl, are the weights Wi G 
connecting layers i and i — 1, and the biases bi G fi is an element-wise non-linear function. 


Let us define Si to be the backpropagated gradient through the i-th non-linearity. We ignore the 
off block-diagonal components of the Fisher matrix and focus on the block , corresponding to 
interactions between parameters of layer i. This block takes the form: 


F\y. = Ea:^~7r 
" y^p 


vec 


{SihJ_i) vec 


where vec(X) is the vectorization function yielding a column vector from the rows of matrix X. 


Assuming that Si and activations hi-i are independent random variables, we can write: 

Fwi{km,ln) fa [6i{k)5i{l)] E^ [hi-i{m)hi-i{n)], (5) 
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Figure 1: (a) A 2-layer natural neural network, (b) Illustration of the projections involved in PRONG. 


where X(i, j) is the element at row i and column j of matrix X and x{i) is the i-th element of vector 
X. Fwi{km,ln) is the entry in the Fisher capturing interact ions between parameters Wi{k,m) 
and Wj{l, n). Our hypothesis, verified experimentally in Sec. 4.1 is that we can greatly improve 
conditioning of the Fisher by enforcing that [hihj] = /, for all layers of the network, despite 
ignoring possible correlations in the 6's and off block diagonal terms of the Fisher. 


3 Projected Natural Gradient Descent 

This section introduces Whitened Neural Networks (WNN), which perform approximate whitening 
of their internal hidden representations. We begin by presenting a novel whitened neural layer, 
with the assumption that the network statistics and E^(6>) = 'E[hihJ] are fixed. 

We then show how these layers can be adapted to efficiently track population statistics over the 
course of training. The resulting learning algorithm is referred to as Projected Natural Gradient 
Descent (PRONG). We highlight an interesting connection between PRONG and Mirror Descent in 
Section 1231 

3.1 A Whitened Neural Layer 

The building block of WNN is the following neural layer, 

hi = fi{ViUi-i{hi-i - Ci) F di). ( 6 ) 

Compared to Eq. we have introduced an explicit centering parameter q = /i^, which ensures 
that the input to the dot product has zero mean in expectation. This is analogous to the centering 
reparametrization for Deep Boltzmann Machines ifTSll . The weight matrix Ui-i e is a 

per-layer ZCA-whitening matrix whose rows are obtained from an eigen-decomposition of E^_i: 

^i = Ui’ diag (A^) • Uf Ui = diag (A^ + e)~^ • U'f. (7) 

The hyper-parameter e is a regularization term controlling the maximal multiplier on the learning 
rate, or equivalently the size of the trust region. The parameters Vi G g. ^ ^Ni 

analogous to the canonical parameters of a neural network as introduced in Eq. though operate 
in the space of whitened unit activations Ui{hi — q). This layer can be stacked to form a deep 
neural network having L layers, with model parameters Q. = and whitening 

coefficients ^ = {Uq, co, * * * ? as depicted in Eig.p3[ 

Though the above layer might appear over-parametrized at first glance, we crucially do not learn 
the whitening coefficients via loss minimization, but instead estimate them directly from the model 
statistics. These coefficients are thus constants from the point of view of the optimizer and simply 
serve to improve conditioning of the Eisher with respect to the parameters O, denoted F^. Indeed, 
using the same derivation that led to Eq. we can see that the block-diagonal terms of Fq now 
involve terms E \{Uihi){Uihi)^^ , which equals identity by construction. 

3.2 Updating the Whitening Coefficients 

As the whitened model parameters ft evolve during training, so do the statistics pi and Eor our 
model to remain well conditioned, the whitening coefficients must be updated at regular intervals. 
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Algorithm 1 Projected Natural Gradient Descent 
1 : Input: training set V, initial parameters 0. 

2 : Hyper-parameters: reparam. frequency T, number of samples Ng, regularization term e. 

3: Ui i — I] Ci i — O 5 1 i — 0 

4: repeat 

5: if mod{t, T) = 0 then t> amortize cost of lines [6-11] 

6 : for all layers i do 

7: Compute canonical parameters Wi = ViUi-i, bi = di + WiCi. \> proj. 

8 : Estimate jii and using Ng samples from V. 

9: Update q from jii and Ui from eigen decomp, of + eJ. > update ^ 

10 : Update parameters Vi ^ WiU^\, Ci ^ hi — ViCi. > proj. P^{0) 

11: end for 

12: end if 

13: Perform SGD update wrt. ft using samples from V. 

14: t i — t -j- 1 

15: until convergence 


while taking care not to interfere with the convergence properties of gradient descent. This can be 
achieved by coupling updates to ^ with corresponding updates to ft such that the overall function 
implemented by the MLP remains unchanged, e.g. by preserving the product ViUi-i before and 
after each update to the whitening coefficients (with an analoguous constraint on the biases). 

Unfortunately, while estimating the mean and diag{T^i) could be performed online over a mini¬ 
batch of samples as in the recent Batch Normalization scheme (71, estimating the full covariance 
matrix will undoubtedly require a larger number of samples. While statistics could be accumulated 
online via an exponential moving average as in RMSprop (271 or K-FAC 0, the cost of the eigen- 
decomposition required for computing the whitening matrix Ui remains cubic in the layer size. 

In the simplest instantiation of our method, we exploit the smoothness of gradient descent by simply 
amortizing the cost of these operations over T consecutive updates. SGD updates in the whitened 
model will be closely aligned to NGD immediately following the reparametrization. The quality 
of this approximation will degrade over time, until the subsequent reparametrization. The resulting 
algorithm is shown in the pseudo-code of Algorithmic We can improve upon this basic amortization 


scheme by including a diagonal scaling of Ui based on the standard deviation of layer i activations. 


after each gradient update, thus mimicking the effect of a diagonal natural gradient method. For this 
update to be valid, this enhanced version of the method, denoted PRONG+, scales the rows of Vi 
accordingly so as to preserve the feed-forward computation of the network. This can be implemented 
by combining PRONG with batch normalization. 


3.3 Duality and Mirror Descent 

There is an inherent duality between the parameters Q of our whitened neural layer and the param¬ 
eters of a canonical model. Indeed, there exist linear projections P^{0) and P^^(U), which map 
from canonical parameters 0 to whitened parameters ft, and vice-versa. P(f){0) corresponds to line 
10 of Algorithm [h while P^^{ft) corresponds to line 7. This duality between 0 and ft reveals a 
close connection fetween PRONG and Mirror Descent 0. 

Mirror Descent (MD) is an online learning algorithm which generalizes the proximal form of gra¬ 
dient descent to the class of Bregman divergences P^(g,p), where G T and : T ^ M is a 
strictly convex and differentiable function. Replacing the L 2 distance by mirror descent solves 
the proximal problem of Eq. by applying first-order updates in a dual space and then project¬ 
ing back onto the primal space. Defining ft = V and 0 = VQ'ij^(ft), with 7/;* the complex 
conjugate of 7 /;, the mirror descent updates are given by: 



( 8 ) 
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Figure 2: Fisher matrix for a small MLP (a) before and (b) after the first reparametrization. Best viewed in 
colour, (c) Condition number of the FIM during training, relative to the initial conditioning. 


It is well known [TSl that the natural gradient is a special case of MD, where the distance 
generating function Ris chosen to be t/;( 6 >) = FO. 


The mirror updates are somewhat unintuitive however. Why is the gradient applied to the dual 
space if it has been computed in the space of parameters 0 ? This is where PRONG relates to MD. It 
is trivial to show that using the function 'ip{0) = instead of the previously defined 

enables us to directly update the dual parameters using Vo, the gradient computed directly in the 
dual space. Indeed, the resulting updates can be shown to implement the natural gradient and are 
thus equivalent to the updates of Eq.j^with the appropriate choice of V^( 6 >): 


d0^ 








'M' 


( 10 ) 


The operators Vt/i and Vt/i* correspond to the projections P^{0) and used by PRONG 

to map from the canonical neural parameters 0 to those of the whitened layers ft. As illustrated 
in Fig. the advantage of this whitened form of MD is that one may amortize the cost of the 
projections over several updates, as gradients can be computed directly in the dual parameter space. 


3.4 Related Work 

This work extends the recent contributions of uni in formalizing many commonly used heuristics 
for training MLPs: the importance of zero-mean activations and gradients COlEl, as well as the 
importance of normalized variances in the forward and backward passes cniiziiia. More recently, 
Vatanen et al. |[28l extended their previous work ifTTl by introducing a multiplicative constant 7 ^ 
to the centered non-linearity. In contrast, we introduce a full whitening matrix Ui and focus on 
whitening the feedforward network activations, instead of normalizing a geometric mean over units 
and gradient variances. 

The recently introduced batch normalization (BN) scheme 171 quite closely resembles a diagonal 
version of PRONG, the main difference being that BN normalizes the variance of activations before 
the non-linearity, as opposed to normalizing the latent activations by looking at the full covariance. 
Furthermore, BN implements normalization by modifying the feed-forward computations thus re¬ 
quiring the method to backpropagate through the normalization operator. A diagonal version of 
PRONG also bares an interesting resemblance to RMSprop ||27l|5l, in that both normalization terms 
involve the square root of the FIM. An important distinction however is that PRONG applies this 
update in the whitened parameter space, thus preserving the natural gradient interpretation. 

K-FAC m is also closely related to PRONG and was developed concurrently to our method. In 
one of its implementations, it targets the same block diagonal as PRONG while also exploiting 

^As the Fisher and thus ^>0 depend on the parameters these should be indexed with a time superscript, 
which we drop for clarity. 
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Figure 3: Optimizing a deep auto-encoder on MNIST. (a) Impact of eigenvalue regularization term e. (b) 
Impact of amortization period T showing that initialization with the whitening reparametrization is important 
for achieving faster learning and better error rate, (c) Training error vs number of updates, (d) Training error 
vs cpu-time. Plots (c) and (d) show that PRONG achieves better error rate both in number of updates and wall 
clock time. 


the low rank structure of these blocks for efficiency, reminiscent of TONGA |[T9ll . Their method 
however operates online via low-rank updates to each block, similar to the preconditioning used in 
the Kaldi speech recognition toolkit US). This is in contrast to our approach based on amortization. 
They also consider the covariance of the backpropagated gradients Si, while PRONG only looks at 
the covariance of activations hi . K-FAC further proposes a tri-diagonal variant which decorrelates 
gradients across neighboring layers, though resulting in a more complex algorithm. 

A similar algorithm to PRONG was later found in ||23l, where it appeared simply as a thought 
experiment, but with no amortization or recourse for efficiently computing F. 


4 Experiments 

We begin with a set of diagnostic experiments which highlight the effectiveness of our method at 
improving conditioning. We also illustrate the impact of the hyper-paramete rs T and e, controlling 
the frequency of the reparametrization and the size of the trust region. Section|4^evaluates PRO NG 
on unsupervised learning problems, where models are both deep and fully connected. Section [4^ 
then moves onto large convolutional models for image classification. 

4.1 Introspective Experiments 

Conditioning. To provide a better understanding of the approximation made by PRONG, we train 
a small 3-layer MLP with tanh non-linearities, on a downsampled version of MNIST (10x10) ini. 
The model size was chosen in order for the full Fisher to be tractable. Fig. |^a-b) shows the FIM 
of the middle hidden layers before and after whitening the model activations (we took the absolute 
value of the entries to improve visibility). Fig. depicts the evolution of the condition number 
of the FIM during training, measured as a percentage of its initial value (before the first whitening 
reparametrization in the case of PRONG). We present such curves for SGD, RMSprop and PRONG. 
The results clearly show that the reparametrization performed by PRONG improves conditioning 
(reduction of more than 95%). These observations confirm our initial assumption, namely that we 
can improve conditioning of the block diagonal Fisher by whitening activations alone. 

Sensitivity of Hyper-Parameters. Figures highlight the effect of the eigenvalue regular¬ 

ization term e and the reparametri zatio n interval T. The experiments were performed on the best 
performing auto-encoder of Section [42| on the MNIST dataset. Figures [^[^plot the reconstruction 
error on the training set for various values of e and T. As e determines a maximum multiplier on the 


6 








learning rate, learning becomes extremely sensitive when this learning rate is higt0 For smaller step 
sizes however, lowering e can yield significant speedups often converging faster than simply using a 
larger learning rate. This confirms the importance of the manifold curvature for optimization (lower 
e allows for different directions to be scaled drastically different according to their corresponding 
curvature). Fig compares the impact of T for models having a proper whitened initialization 
(solid lines), to models being initialized with a standard “fan-in” initialization (dashed lines) tlOl . 
These results are quite surprising in showing the effectiveness of the whitening reparametrization 
as a simple initialization scheme. That being said, performance can degrade due to ill conditioning 
when T becomes excessively large (T = 10^). 

4.2 Unsupervised Learning 

Following Martens ca, we compare PRONG on the task of minimizing reconstruction error of 
an 8-layer auto-encoder on the MNIST dataset. The encoder is composed of 4 densely connected 
sigmoidal layers, with a number of hidden units per layer in {Ik, 500, 250, 30}, and a symmet¬ 
ric (untied) decoder. Hyper-parameters were selected by grid search, based on training error, 
with the following grid specifications: training batch size in {32, 64,128, 256}, learning rates in 
{10“^,10“^,10“^} and momentum term in {0, 0.9}. For RMSprop, we further tuned the mov¬ 
ing average coefficient in {0.99, 0.999} and the regularization term controlling the maximum scal¬ 
ing factor in {0.1, 0.01}. For PRONG, we fixed the natural reparametrization to T = 10^, using 
Ng = 100 samples (i.e. they were not optimized for wallclock time). Reconstruction error with 
respect to updates and wallclock time are shown in Fig.[^(c,d). 

We can see that PRONG significantly outperforms the baseline methods, by up to an order of mag¬ 
nitude in number of updates. With respect to wallclock, our method significantly outperforms the 
baselines in terms of time taken to reach a certain error threshold, despite the fact that the runtime 
per epoch for PRONG was 3.2x that of SGD, compared to batch normalization (2.3x SGD) and RM¬ 
Sprop (9x SGD). Note that these timing numbers reflect performance under the optimal choice of 
hyper-parameters, which in the case of batch normalization yielded a batch size of 256, compared to 
128 for all other methods. Further breaking down the performance, 34% of the runtime of PRONG 
was spent performing the whitening reparametrization, compared to 4% for estimating the per layer 
means and covariances. This confirms that amortization is paramount to the success of our method]^ 

4.3 Supervised Learning 

The next set of experiments addresses the problem of training deep supervised convolutional net¬ 
works for object recognition. Following (Tl, we perform whitening across feature maps only: that 
is we treat pixels in a given feature map as independent samples. This allows us to implement the 
whitened neural layer as a sequence of two convolutions, where the first is by a 1x1 whitening filter. 
PRONG is compared to SGD, RMSprop and batch normalization, with each algorithm being accel¬ 
erated via momentum. Results are presented on both CIFAR-10 lO and the ImageNet Challenge 
(ILSVRC12) datasets 1^ . In both cases, learning rates were decreased using a “waterfall” anneal¬ 
ing schedule, which divided the learning rate by 10 when the validation error failed to improve after 
a set number of evaluations. 0 

4.3.1 CIFAR-10 

The model used for our CIFAR experiments consists of 8 convolutional layers, having 3x3 receptive 
fields. 2x2 spatial max-pooling was applied between stacks of two convolutional layers, with the 
exception of the last convolutional layer which computes the class scores and is followed by global 
max-pooling and soft-max non-linearity. This particular choice of architecture was inspired by the 
VGG model ll22ll and held fixed across all experiments. The number of filters per layer is as follows: 

^Unstable combinations of learning rates and e are omitted for clarity. 

^ We note that our implementation of the whitening operations is not optimized, as it does not take advantage 
of GPU acceleration, as opposed to the neural network computations. Therefore, runtime of our method is 
expected to improve as we move the eigen-decompositions to GPU. 

^ On CIFAR-10, validation error was estimated every 10^ updates and the learning rate decreased by a 
factor of 10 if the validation error failed to improve by 1% over 4 consecutive evaluations. For ImageNet, 
we employed a more aggressive schedule which required that the validation error improves by 1% after each 
epoch. 
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Updates (le4) Time (h) Updates (le4) Time (h) 

(a) (b) (c) (d) 

Figure 4: Classification error on CIFAR-10 (a-b) and ImageNet (c-d). On CIFAR-10, PRONG achieves better 
test error and converges faster. On ImageNet, PRONG^ achieves comparable validation error while maintain¬ 
ing a faster covergence rate. 

64,64,128,128, 256, 256, 512,10. The model was trained on 24 x 24 random crops with random 
horizontal reflections. Model selection was performed on a held-out validation set of 5k examples. 
Results are shown in Fig.|^ 

With respect to training error, PRONG and batch normalization seem to offer similar speedups com¬ 
pared to SGD with momentum. Our hypothesis is that the benefits of PRONG are more pronounced 
for densely connected networks, where the number of units per layer is typically larger than the 
number of maps used in convolutional networks. Interestingly, PRONG generalized better, achiev¬ 
ing 7.32% test error vs. 8.22% for batch normalization. This could reflect the findings of ifTSll . which 
showed how NGD can leverage unlabeled data for better generalization: the “unlabeled” data here 
comes from the extra perturbations in the training set when estimating the whitening matrices. 

4.3.2 ImageNet Challenge Dataset 

Our final set of experiments aims to show the scalability of our method: we thus apply our natural 
gradient algorithm to the large-scale ILSVRC12 dataset (1.3M images labelled into 1000 categories) 
using the Inception architecture Gl. In order to scale to problems of this size, we parallelized our 
training loop so as to split the processing of a single minibatch (of size 256) across multiple GPUs. 
Note that PRONG can scale well in this setting, as the estimation of the mean and covariance param¬ 
eters of each layer is also embarassingly parallel. Eight GPUs were used for computing gradients 
and estimating model statistics, though the eigen decomposition required for whitening was itself 
not parallelized in the current implementation. 

For all optimization algorithms, we considered initial learning rates in {10“^, 10“^, 10“^} and 
used a value of 0.9 as the momentum coefficient. For PRONG we tested reparametrization peri¬ 
ods T G {10,10^, 10^, 10^}, while typically using Ns = O.IT. Eigenvalues were regularized by 
adding a small constant e G {1,10“^,10“^,10“^} before scaling the eigenvectors]^ Given the dif¬ 
ficulty of the task, we employed the enhanced PRONG+ version of the algorithm, as simple periodic 
whitening of the model proved to be unstable]^ 

Figure (c-d) shows that batch normalisation and PRONG+ converge to approximately the same 
top-1 validation error (28.6% vs 28.9% respectively) for similar cpu-time. In comparison, SGD 
achieved a validation error of 32.1%. PRONG+ however exhibits much faster convergence initially: 
after 10^ updates it obtains around 36% error compared to 46% for BN alone. We stress that the 
ImageNet results are somewhat preliminary. While our top-1 error is higher than reported in 171 
(25.2%), we used a much less extensive data augmentation pipeline. We are only beginning to 
explore what natural gradient methods may achieve on these large scale optimization problems and 
are encouraged by these initial findings. 


^The grid was not searched exhaustively as the cost would have been prohibitive. As our main focus is 
optimization, regularization consisted of a simple L 2 weight decay parameter of 10“^, with no Dropout EH. 

^This instability may have been compounded by momentum, which was initially not reset after each model 
reparametrization when using standard PRONG. 
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5 Discussion 


We began this paper by asking whether convergence speed could be improved by simple model 
reparametrizations, driven by the structure of the Fisher matrix. From a theoretical and experi¬ 
mental perspective, we have shown that Whitened Neural Networks can achieve this via a simple, 
scalable and efficient whitening reparametrization. They are however one of several possible in¬ 
stantiations of the concept of Natural Neural Networks. In a previous incarnation of the idea, we 
exploited a similar reparametrization to include whitening of backpropagated gradient^ We fa¬ 
vor the simpler approach presented in this paper, as we generally found the alternative less stable 
with deep networks. Ensuring zero-mean gradients also required the use of skip-connections, with 
tedious book-keeping to offset the reparametrization of centered non-linearities {Tfi . 

Maintaining whitened activations may also offer additional benefits from the point of view of model 
compression and generalization. By virtue of whitening, the projection Uihi forms an ordered rep¬ 
resentation, having least and most significant bits. The sharp roll-off in the eigenspectrum of 
may explain why deep networks are ammenable to compression jJl . Similarly, one could envision 
spectral versions of Dropout 1^ where the dropout probability is a function of the eigenvalues. 
Alternative ways of orthogonalizing the representation at each layer should also be explored, via al¬ 
ternate decompositions of or perhaps by exploiting the connection between linear auto-encoders 
and PCA. We also plan on pursuing the connection with Mirror Descent and further bridging the 
gap between deep learning and methods from online convex optimization. 
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